ÁRBOL guided tutorial

Authors: Kyle Kimler and Ben Doran

Compiled: September 11, 2022

ARBOL iteratively explores axes of variation in scRNA-seq data by clustering and subclustering until variation between cells in subclusters becomes noise. The philosophy of ARBOL is that every axis of variation could be biologically meaningful, so each should be explored. Once every possibility is explored, curation and a statistical interrogation of the resolution are used to collapse clusters into the elemental transcriptomes of the dataset. ARBOL inherently builds a tree of subclustering events. As data is separated by the major axes of variation in each subset, further rounds capture less pronounced variables. Variation shared by all celltypes make up one of the major axes of variation in the first round of clustering, for example, inflammatory signatures. Celltypes can split up at the beginning, or the same splitting of celltypes (e.g. B and T cells) might happen further down in separate branches. Construction of a taxonomy of cell types, subtypes, and states, annotated automatically using a cell-state naming method that captures the unique markers of the end clusters among a user-defined clade, provides an understanding of relationships between the elemental transcriptomes of the dataset.

ARBOL was built to be a modular software. It iterates through the default Seurat analysis pipeline, with two additions that enable automation:

The main use case for this software is to help address the need to manually subcluster scRNA-seq data analyzed in the Seurat framework by providing tunable elements that can be consistently and unbiasedly applied to large-scale datasets. It is complementary to other packages such as scPanoView and iterclust built for specific clustering methods outside of Seurat. The ARBOL() iterative subclustering function is also available in scanpy python ARBOLpy and requires approximately half as much RAM and time to complete. All visualization methods used below are currently only available in R

Load ARBOL and data

Installation of ARBOL is currently done via install_github()

finally, load a seurat object for ARBOL. This tutorial uses the following model dataset:

Perform ARBOL tiered hierarchical clustering

Load ARBOL data

ARBOL() can take a long time. endclustDF provides a dataframe of tierN membership per cell, but you can also read in tiers output from the endclusts folder. You can also read the file "ARBOL_output/polyp_sctransform/endclusts.csv" to get tierN membership per cell as a dataframe.

For ARBOL downstream analysis, you must provide a "sample" and "tierNident" (tierNident is included in the tiers dataframe in the previous cell) column in the Seurat object metadata

Visualize ARBOL

You can build and visualize a tree of clustering events using the function ARBOLcentroidTaxonomy(). Tree-building methods in ARBOL currently support categorical, count, and diversity data, adding these attributes to each node of a data.tree object and a ggraph object. These objects and metrics associated with them are added to the @misc slot of the seurat object.

ggraph supports diverse methods for visualization. Here, the clustering tree is visualized with heights corresponding to tier using layout='tree'. polyp status is plotted per edge to get a rough visualization at which tier clustering separates disease status.

See https://www.data-imaginist.com/2017/ggraph-introduction-layouts/ for a tutorial on ggraph.

ggraph allows addition of node points and text. Once we have rough celltype labels for each cell, we can plot these as text per end-cluster, and color the branches according to where these celltypes are divided in ARBOL(). Here, layout='dendrogram' allows placement of these points and text on only the "leaf" nodes. We like to plot the exact number of cells in each leaf-node alongside celltype membership, as visualizing sizes directly is often uninformative (as seen below)

Instead of node sizes, we like to plot sample diversity as colored boxes per node. This gives us an idea of where clustering finds sample-specific modules

ggraph supports a diverse array of visualization layouts for our trees

Calculate a taxonomy of end-clusters

In analysis of ARBOL tree-structure, we have found that clustering trees are not reflective of cluster relationships. To build a tree that reflects these relationships, we calculate distances between each clusters' feature-centroids in unreduced gene space or in a latent space.

ARBOL tree-building functions calculate a taxonomy of end-cluster relationships using the pvclust() package. Tree-building is wrapped in the function ARBOLcentroidTaxonomy(), which enables allotment of the same classes of variables as used above to the tree's nodes.

This function allows user-input of most pvclust parameters, including hclust method and distance methods, as well as parameters to determine what data to input to the tree calculation. You can specify any latent space in the Embeddings() of your seurat object or you can calculate centroids per cluster of a subset of genes, using the parameters tree_reduction='centroids' and gene_list. gene_list defaults to using all genes. Binary tree calculation is much faster if performed on a dimensionality-reduced space - especially useful for large datasets

A basic visualization is added to the @misc slot of the seurat object.

Using the same attributes as in the cluster trees in the previous section, plotting the newly calculated taxonomy shows celltypes grouped closer together

pvclust bootstrapped p-values for branching points are included in the ggraph object, and can be directly visualized on the ggraph (we recommend filtering the lower splits out). Good to keep in mind here that pvclust bootstraps on features (which are our gene-centroids per cluster in this case), which we find not as reflective of certainty as would be a bootstrap of samples

When we calculate the taxonomy using median centroids of all genes, centroid-bootstrap finds higher p-values for many branching points, indicating instability in some T and apical celltype clade placement

We hope to enable ARBOL users to define their own taxonomy calculation methods.

Please reach out with suggestions for tree-building methods that may not be included in pvclust()

Taxonomy visualization

In addition to taxonomy layouts we've used so far, ggraph supports advanced ggplot functions via ggforce and ggbuild. Here we showcase additional methods for interrogating the tree, including methods for superimposing geoms to ggraph builds, alongside ggforce's zoom and marking

These methods are possible in both taxonomy and cluster trees, but require slightly different grammar

zoom can be especially useful for large or raw trees

It's also possible to zoom using coord_cartesian() to crop the graph rather than producing a double graphic.

Both of these methods can be hacked to contain geom_scatterpie() as well

taxonomy building metrics are saved in the @misc slot in case you need them later

Cultivating the taxonomy

ARBOL provides methods for pruning the tree based on any of the variables included in its calculation. These methods merge clusters up into their branch in the tree. Although ARBOL() tiered hierarchical clustering attempts to find the right resolution for clustering analysis, different datasets and analyses call for different resolutions.

For example, with ARBOL you can merge based on a sample diversity threshold, in case end clusters are dominated by sample-specific effects.

There are two functions for merging clusters - the first is more stable than the second but is restricted to only cell number per cluster (size_threshold) and sample diversity (sample_diversity_threshold)

Data.tree objects have a strange property of only existing in one place in an object, so we have to manually save a copy of the tree outside this environment to be able to modulate the tree without losing the old one

After merging clusters, a new tierNident column is placed in the seurat object's metadata. At this point we must re-calculate the tree to make sure nodes that have unique cells are called end-clusters.

If you're unsatisfied with the merge result, reset the seurat object and merge again with different parameters

Finally, it's possible to merge clusters using a cutree() like algorithm, where you choose the number of clusters you want and the tree is cut horizontally to produce that number of clusters, as displayed below. To do so, add a rank to each internal node of the tree and merge based on a boolean of the rank

Automatically annotate high resolution clusters using getStandardNames()

ARBOL's naming function provides users with an automatic method to name sub-clusters. The most important parameter, "celltype_col" tells the function which clade to perform 1 vs. rest Wilcoxon for each endclust. The genes a standard name is based on will be those that make it unique among the clade specified.

figdir allows you to save the wilcoxon results. If there is a getStandardNames() run in the target figdir already, it will re-load that result instead of calculating again.

max_cells_per_ident can speed up wilcoxon calculation substantially. We have noted stability in multiple runs of getStandardNames() even when only using 100 cells per ident. It's also possible to speed up the wilcoxon by calculating names among finer-grained celltype_col clades. If there are fewer clusters per clade, wilcoxon is faster.

finally, standardname_col is the seurat object metadata column into which to place the resulting standard names

There is a catch here. If some cell states have multiple celltype_col membership, then we need to create a different name for that cell state based on each celltype in which it has membership. When this happens, GetStandardNames() will create a separate column in the metadata for each possible celltype's standard name. The standardname_col you specify in the parameter will be filled with the majority celltype-name for that cell state

Cell state abundance analysis